Multi-class classification algorithms for the diagnosis of anemia in an outpatient clinical setting

Anemia is one of the most pressing public health issues in the world with iron deficiency a major public health issue worldwide. The highest prevalence of anemia is in developing countries. The complete blood count is a blood test used to diagnose the prevalence of anemia. While earlier studies have framed the problem of diagnosis as a binary classification problem, this paper frames it as a multi class (three classes) classification problem with mild, moderate and severe classes. The three classes for the anemia classification (mild, moderate, severe) are so chosen as the world health organization (WHO) guidelines formalize this categorization based on the Haemoglobin (HGB) values of the chosen sample of patients in the Complete Blood Count (CBC) patient data set. Complete blood count test data was collected in an outpatient clinical setting in India. We used Feature selection with Majority voting to identify the key attributes in the input patient data set. In addition, since the original data set was imbalanced we used Synthetic Minority Oversampling Technique (SMOTE) to balance the data set. Four data sets including the original data set were used to perform the data experiments. Six standard machine learning algorithms were utilised to test our four data sets, performing multi class classification. Benchmarking these algorithms was performed and tabulated using both10 fold cross validation and hold out methods. The experimental results indicated that multilayer perceptron network was predominantly giving good recall values across mild and moderate class which are early and middle stages of the disease. With a good prediction model at early stages, medical intervention can provide preventive measure from further deterioration into severe stage or recommend the use of supplements to overcome this problem.


Introduction
Anemia is a disease caused by the deficiency of Iron and it is one of the most critical health problems globally causing serious public health issue [1]. According  demonstrated the utility and potential of Artificial neural networks (ANN) in health care interventions [25]. Children under 5 years old and pregnant women are more vulnerable to anemia due to the greater requirements of Iron for body growth and the expansion of Red Blood Cell (RBC) [26]. Iron deficiency anemia results in a reduction in academic performance and work capacity, which reduces the earning potential of individuals, which affects national economic growth [27]. Alemayehu has used multi variable logistic regression to study the magnitude and severity of anemia in a predominantly rural setting in Ethiopia. Mothers age, mother's occupation, gender of child, Dietary Diversity and food security were found to be key factors influencing the prevalence of anemia in the community population studied [28]. Using complete blood count (CBC) samples, a study to classify anemia using machine learning algorithms of Random Forests, C4.5 (Decision tree), and Naïve Bayes (NB) was undertaken. Comparison of the classifier algorithms for mean absolute error (MAE) and classifier accuracy were computed and tabulated [29]. Laengsri et al [30] used KNN, Decision tree (J48 algorithm), Artificial Neural Networks (ANN) and Support Vector Machines (SVM) to classify computations to distinguish iron deficiency anemia and Thalassemia in a study conducted in Thailand. This also resulted in the development of a web based tool (ThalPred) for this computation [30]. Understanding the geographical distribution of anemia can help target prevention and control mechanisms in which spatial distribution is mapped along with determinant factors in a study done in Ethiopia [31]. The spatial pattern of the rate of anemia among women of reproductive age was visualized and a spatially smoothed proportion obtained using empirical Bayes estimation methods for the spatial analysis [32].
There are three key computations that define the novelty of this research paper. This paper uses complete blood count (CBC) data to diagnose the level of Anemia prevalent in the study population. Earlier studies have framed the diagnosis task as a binary classification problem-the patients in the study population are either anaemic or non-anaemic.
In this paper the diagnosis is framed as a multi class classification problem. We use feature selection to identify key attributes of the patient data set. We have used Majority voting on three feature selection techniques-correlation [33], Classification and Regression Tree (CART) [34] and Gradient Boosting [35] to identify the key attributes of the input patient data set.
We benchmark the performance of key machine learning algorithms for multi class classification. The computations are done using both the 10-fold cross validation method as well as the Hold out method. The results of this analysis are tabulated as described further in this paper.
The rest of this paper is organized as follows. Section 2 material and methods, Section 3 data analysis while Section 4 describes feature selection method used in this paper. Section 5 presents work flow, Section 6 describes techniques employed for the anemia classification. Section 7 presents performance assessment parameters used, while Section 8 presents the results and discussions. Finally, Section 9 presents conclusions and directions for further work.

Material and methods
We assess the prevalence of different types of Anemia including its severity and association with age and gender of the study population with CBC data set parameters as variables. We use data from complete blood count test performed by Hematology analyzer to determine the prevalence of different types of anemia treated at the Eureka diagnostic center in Lucknow, India. All the procedures for the CBC test were done following standard operating protocols defined for the Hematology analyzer.

Study subjects
The study subjects were patients who visited the Eureka diagnostic center, Lucknow, India for CBC investigation, of whom 400 patient samples were randomly selected to compute the prevalence of anemia and for further investigation into Anemia classification. The patients visited the Eureka diagnostic center in Lucknow for various clinical examinations. The diagnostic center performs 4 -8CBC investigations a day on average. During the data collection period, between September 2020 to December 2020, 1000 CBC investigations were performed, out of which 400 random samples were selected for further analysis and study. Data set was available on Mendeley Data Repository, Data identification number: 10.17632/dy9mfjchm7. 1 Direct URL to data: https://data.mendeley.com/datasets/dy9mfjchm7/1.

Inclusion criterion
Hemoglobin values represented by the HGB attribute in the CBC data set was selected as the response variable. We included adult males and females who are not pregnant and older than 15 years of age in the study population. Infants and young children less than 10 years old and pregnant women were excluded from the study due to various factors like variable CBC test values and other factors. After excluding the above stated persons from the randomly chosen sample of 400 patients, we were left with 364 patients in the final data set to be investigated.

Procedures
Laboratory staff of Eureka diagnostic center collected 5 ml of blood sample for the CBC tests. Randomly selected samples of patients for this study had HGB values between 4.2and 19.6and they were selected for further analysis. The CBC report had 11 attributes in the data set shown in the Table 1 Titled Dataset Description.
These attribute values were recorded for each patient in the sample data set. The computation starts with an analysis of the distribution of anemia as describes in the S1 Appendix.

Ethical considerations
Ethical consideration was approved by Eureka diagnostic center, Lucknow, India to collect and analyze patient data for the CBC test reports in which patient consent was obtained. Support for the data collection and compilation was also obtained from the management and staff

Data analysis
In the eleven attributes of the Complete Blood Count (CBC) patient data set the Hemoglobin (HGB) attribute is the response variable. The anemia status in an individual patient can be either mild, moderate or severe based on the value of the HGB variable. This categorization for both men and women is described in S1 Appendix. This is defined strictly according to the World Health Organization (WHO) guidelines for anemia categorization. Data collected was entered in Excel format for the eleven attributes with the size of the data set being 364 records. The prominent feature selection techniques such as Correlation, CART and Recursive feature selection techniques using gradient boosting were used to identify significant features (feature selection). The brief description of the above techniques can be seen in the section named Feature Selection. We performed majority voting for feature selection. Weka tool was used to compute the multiclass classification output and the final results were computed and tabulated [36]. Our dataset was imbalanced in which imbalance class creates a bias where the machine learning model tends to predict the majority class [37]. We used Synthetic Minority Oversampling Technique (SMOTE) techniques for balancing the data set using Knime tool available at https://www.knime.com/downloads. Association of anemia prevalence with age and gender were computed. The prevalence of microcytic, normocytic and macrocytic anemia was computed along with their association with Age. This is shown in Tables 2 and 3 along with Figs 1 and 2.
In addition Feature selection was performed using multiple techniques to identify the important features of the dataset which was then further studied using multiclass classification to determine the extent of anemia prevalence as being Mild, Moderate or Severe.
The final results obtained were compiled and prepared in the form of this research paper to report the findings of this study.
A total of 364 patient samples constituted the study population for this research study. We note that mild anemia is most prevalent in the 46-60 years age group, moderate anemia is prevalent most in the younger age group of 10-30 years age group, while severe anemia is most prevalent in the 61-90 years age group.
It was noted that from Figs 1 and 2, mild anemia occurs lowest in the 61-90 years age group. Moderate anemia occurs lowest in the 31-45 years age group. Severe anemia has lowest prevalence in the 46-60 years age group. The results of the data description presented in Table 2  It can be observed that Normocytic anemia is highest in prevalence in the 10-30 years age group and lowest in the 61-90 years age group. Microcytic anemia is highest in prevalence in the 61-90 years age group and lowest in the 10-30 years age group. Macrocytic anemia is highest in prevalence in the 31-45 years age group and lowest in the 61-90 years age group. We further note that the 10-30 years age group has most common occurrence of normocytic anemia, the 31-45 years age group has microcytic as the most common prevalence, the 46-60 years age group has microcytic as the most common prevalence and the 61-90 years age group has microcytic as the most common prevalent anemia. Furthermore, the second most prevalent anemia for the age group 10-30 years was microcytic, 31-45 years was normocytic, 46-60 years was normocytic and 61-90 years was normocytic. Table 3 shows the prevalence of normocytic, normochromic and macrocytic by age while Table 4 gender wise prevalence of anemia in the study population. The overall prevalence of anemia in the given data set is as follows: Mild anemia: 70.32% Moderate anemia: 25.28% Severe anemia: 4.40% This shows that mild anemia is most prevalent in the study population. In females, the most prevalent is the mild anemia and in males the most prevalent is also the mild anemia. Moderate anemia is more prevalent in males than in females. Severe anemia is more prevalent in males than in females.

Feature selection
The feature selection methods are utilised to reduce redundant features extracted from the raw data. The objective is to provide better understanding of the dataset and to allow a faster analysis and classification. In feature selection process, the features are elected based on their ranking and best suitability to the classifiers performance. Main methods of feature selection are (1) filter, (2) wrapper, and (3) embedded [38]. In this case, filter methods apply statistical measures to provide ranking scores for each feature. According to their score, they are either included or excluded from the dataset, examples include Chi-square and Gain ratio. Wrapper methods elect combinations feature subsets and assess their usability on a given machine learning algorithm. In view of that, the subsets are scored based on their predictive power. While, embedded methods learn the best features  according to the correctness (accuracy) of the learning model, Common types of embedded methods include decision tree algorithms such as CART, random forest and C4.5.
In this paper, the dataset collected was entered in Excel format for the eleven attributes mentioned in the Material and methods of Section 2.
The following techniques were used for our feature selection methods.
1. CART based feature selection: CART based feature selection technique is also a widely used feature selection technique due to its distinguishable capability of the features. This is a tree based classification algorithm that ranks the features based on their distinguishing capability and few features from the top would be selected based on the user need and criteria.
2. Correlation based feature selection: This is another popular feature selection technique that selects those features which are marginally co-related or not co-related, this is filter based feature selection and it is a multivariate method. In this case, if two or more highly correlated features exist, it tries to select only one feature among them.
3. Recursive feature selection: This is one of the prominent feature selection techniques that are available in Scikit-Learn. It recursively selects the important features in each training phase and discards the less important one.
Finally, we performed majority voting for feature selection on Recursive feature selection using gradient boosting, correlation and CART. Feature selection procedure is shown in the Our analysis indicated that 7 attributes namely Age, Sex, PCV, MCH, MCHC, PLT and HGB are highly ranked. Hence, patient records with this reduced feature are used for our experiments.

Work flow and framework of the proposed model
The original data collected is referred to as dataset 1. Then feature selection on dataset 1 was performed using majority voting which was applied using Recursive feature selection using gradient boosting, correlation and CART, we refer to this data as dataset 2. Dataset 1 was imbalanced, hence SMOTE technique was used to create dataset 3. We followed same

Performance assessment parameters
Performance assessment is an important step for developing reliable and useful classifier. In our study we used five standard quality measure namely accuracy, sensitivity, specificity, area under the curve (AUC) and Kappa statistic for the performance assessment of the various classification techniques. These are defined as follows: Recall is the measure of proportion of the true positives, which are correctly identified. Accuracy is the measure of proportion of true positives and true negatives, which are correctly identified.

Recall
Accuracy ¼ ðTrue positive þ True negativeÞ ðTrue positive þ True negative þ False positive þ False negativeÞ AUC used in classification analysis in order to determine which of the used models predicts the classes best. The AUC lies between 0 and 1, where a perfect classifier can take a maximum value of 1.
The Kappa statistic is used to control only those instances that may have been correctly classified by chance. It is a good measure to handle multi class and imbalanced class problems. It is a measure of the agreement between the predicted and the actual classifications in a data set.

PLOS ONE
Using multi class classification in anemia diagnosis Note that Total accuracy is the observed agreement between the predicted and the actual classification in a data set [50].
Consider the following Binary classification confusion matrix: The data on the main diagonal (a and d) represents the count of the number of agreements between the two trials A and B while the off diagonal data (b and c) represents the count of the number of disagreements between the two trials A and B.
The observed proportionate agreement represents the Total accuracy and is denoted by the formula: Total accuracy = (a+d)/(a+b+c+d) Random accuracy: Assume we have two readers of the data-one for each trial.

We define p yes = (a+b) /(a+b+c+d). (a+c) /(a+b+c+d) Similarly P no = (c+d)/(a+b+c+d). (b+d)/(a+b+c+d)
The random agreement probability is the probability that the two readers (judges) in the two trials agreed on either yes or no i.e. p yes + p no also called as the Random accuracy for the computation.
Thus we get the computed Kappa statistic as given in the above formula Kappa ¼ ðtotal accuracy À random accuracyÞ ð1 À random accuracyÞ

Results and discussion
All the experiments have been conducted exhaustively by fine tuning the various hyper parameters of the proposed machine learning algorithms. The results have been tabulated based on the datasets used for experiments. The experiments have been carried out using hold-out and 10-fold cross validation. Based on the literature survey, it has been found that the proposed six machine learning techniques have superior performance in the context of similar type of problems. Moreover, from visualization of the dataset used, it has been found our data is having a mixture of categorical and continuous types of values and the techniques used in this paper had been giving promising results for this type of data. In hold-out method the ratio of the training and testing taken are 80%and 20% respectively. Table 5 represents the results of the various machine learning techniques used which has been tested on the original dataset. Whereas Table 6 represents the results of various machine learning techniques used which has been tested on our feature selected dataset. It can be observed from the Table 5 that logistic regression has given the highest accuracy and Kappa value for the hold-out and 10-fold cross validation methods. Whereas in the case of other metrics, like recall, precision and AUC based on the dataset and methodology used for experiment, different techniques had given promising results.
For Mild class as shown in Table 5 dominated with recall values of 100 percent and 90.2 percent in the case of hold-out and 10-fold cross validation methods and AUC values of 98.5 and 97.6% in the case of hold-out and 10-fold cross validation methods. Severe class being the critical stage of the Anemia disease and due to the low quantity of samples of that data, it has yielded poor results. The results obtained using the feature selected dataset in Table 6 show that in the case of Mild class, Decision Tree had given the best result with recall value of 91.5 percent, precision value of 95.6 percent and AUC value of 96.6 percent when hold-out method is used. Whereas in the case of 10-fold cross validation method, LR has given the recall value of 92.   As the dataset is having disproportioned samples for the three classes, we utilized Synthetic Minority Oversampling Technique (SMOTE) to balance our dataset. Tables 7 and 8 provide the results of the experiments obtained using the SMOTE dataset. Table 7 indicates that MLP  The reason for getting good results with this dataset can be attributed to the class balanced property which has been achieved with the help of SMOTE technique used and due to the feature selection method adopted for the dataset after applying SMOTE, has filtered out the noise in the dataset and had improvised the classification capabilities of the machine learning algorithms.
As far as the reason regarding performance of the machine learning algorithms in Table 5 is concerned MLP was predominant among all the techniques used, this can be attributed to the inherent nature of MLP as being a neural network, it was good at capturing the learning capabilities from the data for the majority classes. As severe class was not having enough samples to train a model well, MLP failed in that class. LR being good model for predicting the continuous values, it had performed better in case of severe class as the dataset was also having enough ratio of continuous values.
When we look at Table 6 results, it can be observed that DT and RF had given competitively good performance over other methods in the case of holdout method. This can be attributed due to the reduction of noise when feature selection is done. Whereas in the case of 10-fold cross validation LR and MLP were competitively performing well in mutually opposite classes compared with other methods. In the case of Table 7, since the dataset is balanced using SMOTE method, the learning features of the dataset had increased thereby giving a good scope for neural network based MLP technique. Hence, MLP has unanimously dominated over other algorithms irrespective of mode of training (i.e., Hold-out or 10-fold cross validation) Based on the given classes in the dataset and their significance in the field of medical domain, we can summarize the results as follows: • Mild class being the starting stage of the Anemia, it's crucial to identify patients who have been diagnosed in this stage and inform them as early as possible so that precautionary measures can be taken to ensure that their Anemia does not deteriorate rapidly. Hence, the MLP helps in identifying the Mild type of cases more efficiently, as it has given a good recall value for this class in most of the cases.

Conclusions and directions for further work
Anemia is one of the prevalent diseases among women and children globally, the disease needs to be identified and treated in its earlier stages since it can affect academic performance and work output of adults thus affecting a nation's economy and society. In this line, this paper addresses the problem of identifying the Anemia disease at various stages with the help of different machine learning techniques that can accurately classify the patient to the class (stage of Anemia). Mild-class (first stage) being the critical stage where in identifying and avoiding further advancement into deteriorating stages can be done. MLP showed promising results for the classification of Anemia. For the Moderate class, MLP demonstrates best recall and AUC values. The Severe class being the most advanced stage of Anemia, has least significance in classification, as most of the patients would have been informed about the existence of this disease by this stage. Yet, there might be few cases, where the patient would not be knowing about the existence of Anemia at this stage. In this context, this paper tries to predict the existence of Anemia at this stage accurately with the help of Decision Tree and MLP. In summary, the paper tries to predict the existence of the Anemia at various stages with the help of Machine Learning techniques that have proven to be accurate for our Anemia dataset. Data was collected in an outpatient clinical setting in India and the prevalence of anemia by age and gender were computed. The study used three feature selection techniques with majority voting to identify the most significant features in the input data set. Various multi class classification algorithms were used to perform the diagnosis of anemia as mild, moderate and severe. SMOTE techniques were used to deal with the class imbalance problem since the original data set was imbalanced. In all four data sets including the original data set were used to perform the data experiments. Performance benchmarking for the six machine learning algorithms used was done and tabulated using both the 10-fold cross validation and hold out methods.
A comparative analysis of the classification results obtained can be done by sourcing the Anemia patient data set from other locations and regions. It is planned to source this data set from Africa in the near future so that a comparative study of the results obtained can be performed.
Furthermore, using machine learning techniques to rank possible social determinants of anemia can be performed. The study can also be localised to a community setting whereby macro health indices relating to diet and nutrition can be computed. Accordingly, such interventions can be designed on the basis of the computed macro health indices which relate both to the nutrition and diet side of policy setting as well as other macro indices in the community such as financial indices and the broader socio-economic environment.
The computing of nutrition related indices like Diet diversity score, Food security score and malnutrition score can be used to design nutrition related public health interventions. The distribution of these scores in the community can help us design targeted interventions for the benefit of the community. In addition, scores like wealth index give an idea of the financial status of the community residents. The distribution of this score can help in designing policy measures on the social welfare side of policy planning. These computations can also help in designing suitable aggregated social and health score cards for the community.
Finally, a new hybrid model can be designed for classifying iron deficiency anemia that is based on the concepts of Deep learning, genetic algorithms and convolutional neural networks (CNN) using the same data sets.